:orphan:
Core Basics 1: Train, Evaluate and Deploy a Classifier
======================================================
In this lesson we will learn how to train, evaluate and deploy
classifiers with Khiops.
Make sure you have installed `Khiops `__ and
`Khiops Visualization `__.
We start by importing Khiops and defining some helper functions:
.. code:: ipython3
import os
import platform
import subprocess
from khiops import core as kh
# Define peek helper function
def peek(file_path, n=10):
"""Shows the first n lines of a file"""
with open(file_path, encoding="utf8", errors="replace") as file:
for line in file.readlines()[:n]:
print(line, end="")
print("")
# If there are any issues, you may print Khiops status with the following command:
# kh.get_runner().print_status()
Training a Classifier
---------------------
We’ll train a classifier for the ``Iris`` dataset. This is a classical
dataset containing the data of different plants belonging to the genus
*Iris*. It contains 150 records, 50 for each of three variants of
*Iris*: *Setosa*, *Virginica* and *Versicolor*. The records for each
sample contain the length and width of its petal and sepal. The standard
task for this dataset is to construct a classifier for the type of
*Iris* taking as inputs the length and width characteristics.
Now to train a classifier with Khiops, we use two types of files: - A
plain-text delimited data file (for example a ``csv`` file) - A
*dictionary* file which describes the schema of the above data table
(``.kdic`` file extension)
Let’s save, into variables, the locations of these files for the
``Iris`` dataset and then take a look at their contents:
.. code:: ipython3
iris_kdic = os.path.join(kh.get_samples_dir(), "Iris", "Iris.kdic")
iris_data_file = os.path.join(kh.get_samples_dir(), "Iris", "Iris.txt")
print(f"Iris dictionary file: {iris_kdic}")
peek(iris_kdic)
print(f"Iris data file: {iris_data_file}\n")
peek(iris_data_file)
.. parsed-literal::
Iris dictionary file: /github/home/khiops_data/samples/Iris/Iris.kdic
Dictionary Iris
{
Numerical SepalLength ;
Numerical SepalWidth ;
Numerical PetalLength ;
Numerical PetalWidth ;
Categorical Class ;
};
Iris data file: /github/home/khiops_data/samples/Iris/Iris.txt
SepalLength SepalWidth PetalLength PetalWidth Class
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
5.4 3.9 1.7 0.4 Iris-setosa
4.6 3.4 1.4 0.3 Iris-setosa
5.0 3.4 1.5 0.2 Iris-setosa
4.4 2.9 1.4 0.2 Iris-setosa
Note that the *Iris* variant information is in the column ``Class``. Now
let’s specify the path to the analysis report file.
.. code:: ipython3
analysis_report_file_path_Iris = os.path.join("exercises", "Iris", "AnalysisReport.khj")
print(f"Iris analysis report file path: {analysis_report_file_path_Iris}")
.. parsed-literal::
Iris analysis report file path: exercises/Iris/AnalysisReport.khj
We are now ready to train the classifier with the Khiops function
``train_predictor``. This method returns a tuple containing the location
of two files: - the modeling report (``AnalysisReport.khj``): A JSON
file containing information such as the informativeness of each
variable, those selected for the model and performance metrics. It is
saved into ``analysis_report_file_path_Iris`` variable that we just
defined. - model’s *dictionary* file (``AnalysisReport.model.kdic``):
This file is an enriched version of the initial dictionary file that
contains the model. It can be used to make predictions on new data.
.. code:: ipython3
iris_report, iris_model_kdic = kh.train_predictor(
iris_kdic,
dictionary_name="Iris",
data_table_path=iris_data_file,
target_variable="Class",
analysis_report_file_path=analysis_report_file_path_Iris,
max_trees=0, # by default Khiops constructs 10 decision tree variables
)
print(f"Iris report file: {iris_report}")
print(f"Iris modeling dictionary: {iris_model_kdic}")
.. parsed-literal::
Iris report file: exercises/Iris/AnalysisReport.khj
Iris modeling dictionary: exercises/Iris/AnalysisReport.model.kdic
Note that ``iris_report`` (the first element of the tuple returned by
train_predictor) is identical to ``analysis_report_file_path_Iris``.
In the next sections, we’ll use the file at ``iris_report`` to assess
the models’ performances and the file at ``iris_model_kdic`` to deploy
it. Now we can have a look at the report with the Khiops Visualization
app:
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(iris_report)
Exercise
~~~~~~~~
We’ll repeat the previous steps on the ``Adult`` dataset. This dataset
contains characteristics of the adult population in USA such as age,
gender and education and its task is to predict the variable ``class``,
which indicates if the individual earns ``more`` or ``less`` than 50,000
dollars.
Let’s start by putting, into variables, the paths for the ``Adult``
dataset:
.. code:: ipython3
adult_kdic = os.path.join(kh.get_samples_dir(), "Adult", "Adult.kdic")
adult_data_file = os.path.join(kh.get_samples_dir(), "Adult", "Adult.txt")
Print the file locations and use the function ``peek`` to list their contents
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
print(f"Adult dictionary file: {adult_kdic}")
peek(adult_kdic)
print(f"Adult data file: {adult_data_file}\n")
peek(adult_data_file)
.. parsed-literal::
Adult dictionary file: /github/home/khiops_data/samples/Adult/Adult.kdic
Dictionary Adult
{
Categorical Label ;
Numerical age ;
Categorical workclass ;
Numerical fnlwgt ;
Categorical education ;
Numerical education_num ;
Categorical marital_status ;
Adult data file: /github/home/khiops_data/samples/Adult/Adult.txt
Label age workclass fnlwgt education education_num marital_status occupation relationship race sex capital_gain capital_loss hours_per_week native_country class
1 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States less
2 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States less
3 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States less
4 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States less
5 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba less
6 37 Private 284582 Masters 14 Married-civ-spouse Exec-managerial Wife White Female 0 0 40 United-States less
7 49 Private 160187 9th 5 Married-spouse-absent Other-service Not-in-family Black Female 0 0 16 Jamaica less
8 52 Self-emp-not-inc 209642 HS-grad 9 Married-civ-spouse Exec-managerial Husband White Male 0 0 45 United-States more
9 31 Private 45781 Masters 14 Never-married Prof-specialty Not-in-family White Female 14084 0 50 United-States more
We now specify the path to the analysis report file for this exercise:
.. code:: ipython3
analysis_report_file_path_Adult = os.path.join(
"exercises", "Adult", "AnalysisReport.khj"
)
print(f"Adult analysis report file path: {analysis_report_file_path_Adult}")
.. parsed-literal::
Adult analysis report file path: exercises/Adult/AnalysisReport.khj
Train a classifier for the ``Adult`` database
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note the name of the target variable is ``class`` (**in lower case!**).
Do not forget to set ``max_trees=0``. Save the resulting file locations
into the variables ``adult_report`` and ``adult_model_kdic`` and print
them.
.. code:: ipython3
adult_report, adult_model_kdic = kh.train_predictor(
adult_kdic,
dictionary_name="Adult",
data_table_path=adult_data_file,
target_variable="class",
analysis_report_file_path=analysis_report_file_path_Adult,
max_trees=0,
)
print(f"Adult report file: {adult_report}")
print(f"Adult modeling dictionary file: {adult_model_kdic}")
.. parsed-literal::
Adult report file: exercises/Adult/AnalysisReport.khj
Adult modeling dictionary file: exercises/Adult/AnalysisReport.model.kdic
Inspect the results with the Khiops Visualization app
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
# To visualize uncomment the line below
# kh.visualize_report(adult_report)
Accessing a Classifiers’ Basic Evaluation Metrics
-------------------------------------------------
We access the classifier’s evaluation metrics by loading the file at
``iris_report`` with the Khiops function ``read_analysis_results_file``:
.. code:: ipython3
iris_results = kh.read_analysis_results_file(iris_report)
print(type(iris_results))
.. parsed-literal::
The resulting object is an instance of the ``AnalysisResults`` class.
The model evaluation reports are stored in its
``train_evaluation_report`` and ``test_evaluation_report`` attributes
which are of class ``EvaluationReport``.
.. code:: ipython3
iris_train_eval = iris_results.train_evaluation_report
iris_test_eval = iris_results.test_evaluation_report
print(type(iris_train_eval))
print(type(iris_test_eval))
.. parsed-literal::
We access the default predictor’s metrics with the
``get_snb_performance`` method of the evaluation report objects:
.. code:: ipython3
iris_train_performance = iris_train_eval.get_snb_performance()
iris_test_performance = iris_test_eval.get_snb_performance()
These objects are of class ``PredictorPerformance``. They have access to
``accuracy`` and ``auc`` attributes:
.. code:: ipython3
print(f"Iris train accuracy: {iris_train_performance.accuracy}")
print(f"Iris test accuracy: {iris_test_performance.accuracy}")
print("")
print(f"Iris train AUC: {iris_train_performance.auc}")
print(f"Iris test AUC: {iris_test_performance.auc}")
.. parsed-literal::
Iris train accuracy: 0.980952
Iris test accuracy: 0.955556
Iris train AUC: 0.998134
Iris test AUC: 0.984362
Exercise
~~~~~~~~
Read the contents of the file at ``adult_report`` for the Adult analysis and print its type
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_results = kh.read_analysis_results_file(adult_report)
type(adult_results)
.. parsed-literal::
khiops.core.analysis_results.AnalysisResults
Save the evaluation reports of the ``Adult`` classification to the variables ``adult_train_eval`` and ``adult_test_eval``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_train_eval = adult_results.train_evaluation_report
adult_test_eval = adult_results.test_evaluation_report
Show the model’s train and test accuracies and AUCs
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
.. code:: ipython3
adult_train_performance = adult_train_eval.get_snb_performance()
adult_test_performance = adult_test_eval.get_snb_performance()
print(f"Adult train accuracy: {adult_train_performance.accuracy}")
print(f"Adult test accuracy: {adult_test_performance.accuracy}")
print("")
print(f"Adult train AUC: {adult_train_performance.auc}")
print(f"Adult test AUC: {adult_test_performance.auc}")
.. parsed-literal::
Adult train accuracy: 0.86947
Adult test accuracy: 0.86592
Adult train AUC: 0.926153
Adult test AUC: 0.921511
Deploying a Classifier
----------------------
We are going to deploy the ``Iris`` classifier we have just trained on
the same dataset (normally we would do this on new data). We saved the
model in the file ``iris_model_kdic``. This file is usually large and
incomprehensible, so you should know what you are doing before editing
it. Let’s take a quick look at its contents:
.. code:: ipython3
peek(iris_model_kdic, 25)
.. parsed-literal::
#Khiops 11.0.0-b.0
Dictionary SNB_Iris
{
Unused Numerical SepalLength ;
Unused Numerical SepalWidth ;
Unused Numerical PetalLength ;
Unused Numerical PetalWidth ;
Unused Categorical Class ;
Unused Structure(DataGrid) VClass = DataGrid(ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 32, 35)) ;
Unused Structure(DataGrid) PPetalLength = DataGrid(IntervalBounds(3.15, 4.75, 5.15), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 1, 26, 5, 0, 0, 0, 9, 26)) ; // DataGrid(PetalLength, Class)
Unused Structure(DataGrid) PPetalWidth = DataGrid(IntervalBounds(0.75, 1.75), ValueSetC("Iris-setosa", "Iris-versicolor", "Iris-virginica"), Frequencies(38, 0, 0, 0, 31, 1, 0, 2, 33)) ; // DataGrid(PetalWidth, Class)
Unused Structure(Classifier) SNBClass = SNBClassifier(Vector(0.453125, 0.5), DataGridStats(PPetalLength, PetalLength), DataGridStats(PPetalWidth, PetalWidth), VClass) ;
Categorical PredictedClass = TargetValue(SNBClass) ;
Unused Numerical ScoreClass = TargetProb(SNBClass) ;
Numerical `ProbClassIris-setosa` = TargetProbAt(SNBClass, "Iris-setosa") ;
Numerical `ProbClassIris-versicolor` = TargetProbAt(SNBClass, "Iris-versicolor") ;
Numerical `ProbClassIris-virginica` = TargetProbAt(SNBClass, "Iris-virginica") ;
};
Note that the modeling dictionary contains 4 used variables: -
``PredictedClass`` : The class with the highest probability according to
the model - ``ProbClassIris-setosa``, ``ProbClassIris-versicolor``,
``ProbClassIris-virginica``: The probabilities of each class according
to the model
These will be the columns of the table obtained after deploying the
model. This table will be saved at ``iris_deployment_file``.
.. code:: ipython3
iris_deployment_file = os.path.join("exercises", "Iris", "iris_deployment.txt")
kh.deploy_model(
iris_model_kdic,
dictionary_name="SNB_Iris",
data_table_path=iris_data_file,
output_data_table_path=iris_deployment_file,
)
peek(iris_deployment_file)
.. parsed-literal::
PredictedClass ProbClassIris-setosa ProbClassIris-versicolor ProbClassIris-virginica
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Iris-setosa 0.9935139877 0.004559173379 0.001926838879
Exercise
~~~~~~~~
Use the ``deploy_model`` function to deploy the model stored in the file at ``adult_model_kdic``
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Which columns are deployed?
.. code:: ipython3
adult_deployment_file = os.path.join("exercises", "Adult", "adult_deployment.txt")
kh.deploy_model(
adult_model_kdic,
dictionary_name="SNB_Adult",
data_table_path=adult_data_file,
output_data_table_path=adult_deployment_file,
)
peek(adult_deployment_file)
.. parsed-literal::
Predictedclass Probclassless Probclassmore
less 0.9999926806 7.319380182e-06
more 0.4107568382 0.5892431618
less 0.9622314248 0.03776857516
less 0.9172269213 0.08277307874
less 0.5833340928 0.4166659072
more 0.2619499457 0.7380500543
less 0.9940101932 0.005989806772
more 0.4199564537 0.5800435463
more 0.001247535351 0.9987524646